Skip to content

[TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC#12060

Open
xxi-nv wants to merge 1 commit intoNVIDIA:mainfrom
xxi-nv:fix_moe_bugs
Open

[TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC#12060
xxi-nv wants to merge 1 commit intoNVIDIA:mainfrom
xxi-nv:fix_moe_bugs

Conversation

@xxi-nv
Copy link
Collaborator

@xxi-nv xxi-nv commented Mar 10, 2026

Summary

  • Fix MoE DeepEP hang caused by non-deterministic garbage collection timing across ranks
  • DeepEP Buffer.__del__ calls intranode::barrier (a collective op). Without explicit synchronous release, GC timing differences across ranks cause some ranks to block in the barrier indefinitely
  • Add destroy() method to Communication base class and DeepEP/DeepEPLowLatency subclasses for explicit buffer release
  • Call destroy() in ConfigurableMoE when falling back from DeepEP to AllGatherReduceScatter
  • Make ConfigurableMoE a context manager so resources are released on scope exit
  • Re-enable DEEPEPLOWLATENCY multi-GPU tests on H100 that were previously skipped due to this hang

Test plan

  • Unit tests updated to use context manager pattern (with create_moe(...) as fused_moe)
  • Multi-GPU MoE tests with DEEPEPLOWLATENCY on H100 (previously hanging, now expected to pass)
  • CI validation

Summary by CodeRabbit

  • New Features

    • MoE communication components now support context manager syntax (with statement) for streamlined resource management.
    • Added explicit lifecycle control methods to communication layer components.
  • Tests

    • Enhanced test coverage for communication subsystem reliability and low-latency scenarios.

Fix multi-GPU hangs caused by non-deterministic GC of DeepEP buffers by
adding explicit destroy() to Communication classes and calling it before
fallback to AllGatherReduceScatter.

Also add explicit comm resource cleanup in test teardown to prevent
inter-test barrier deadlock from lingering DeepEP buffers.

Remove the now-unnecessary DEEPEPLOWLATENCY skip on H100 (SM90).

Signed-off-by: xxi <xxi@nvidia.com>
@xxi-nv xxi-nv requested a review from a team as a code owner March 10, 2026 03:10
@xxi-nv xxi-nv requested a review from HuiGao-NV March 10, 2026 03:10
@xxi-nv xxi-nv requested review from QiJune and leslie-fang25 and removed request for HuiGao-NV March 10, 2026 03:13
@xxi-nv
Copy link
Collaborator Author

xxi-nv commented Mar 10, 2026

/bot run

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 843137a8-c68d-4849-99c3-18560b46e674

📥 Commits

Reviewing files that changed from the base of the PR and between 8d8b84b and 140c9d2.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/modules/fused_moe/communication/base.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
  • tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
  • tests/unittest/_torch/modules/moe/moe_test_utils.py
  • tests/unittest/_torch/modules/moe/test_moe_module.py
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/modules/moe/moe_test_utils.py

📝 Walkthrough

Walkthrough

This pull request adds lifecycle management to TensorRT-LLM's fused MoE communication strategies. A destroy() method is introduced across communication classes (base, DeepEP, DeepEPLowLatency) to explicitly release resources. ConfigurableMoE gains context manager support (enter, exit) and calls destroy() when switching communication strategies. Test utilities are updated to remove a device-specific skip condition and adjust context manager syntax.

Changes

Cohort / File(s) Summary
Communication Strategy Lifecycle
tensorrt_llm/_torch/modules/fused_moe/communication/base.py, tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py, tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py
Added non-abstract destroy() method to base Communication class and implemented it in DeepEP and DeepEPLowLatency subclasses to explicitly release internal buffers and prevent deadlock/hang from barrier synchronization during garbage collection.
ConfigurableMoE Lifecycle & Strategy Management
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
Implemented context manager protocol (__enter__, __exit__) and destroy() method for ConfigurableMoE. Modified determine_communication_method() to call destroy() on existing communication strategy before replacing it with AllGatherReduceScatter.
Test Utilities & Module Updates
tests/unittest/_torch/modules/moe/moe_test_utils.py, tests/unittest/_torch/modules/moe/test_moe_module.py
Removed device-capability skip rule for DEEPEPLOWLATENCY on H100 SM90. Updated test context manager syntax from nested with blocks to tuple-based form for improved resource management.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 73.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: fixing a MoE DeepEP hang caused by non-deterministic GC, with proper ticket reference [TRTLLM-11037] and type [bug].
Description check ✅ Passed The PR description includes all critical sections: Summary, Test plan with checkboxes, and covers what/why of the changes. While some non-critical template sections (PR Checklist) are absent, the description is substantially complete and follows the intended template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38362 [ run ] triggered by Bot. Commit: 140c9d2 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38362 [ run ] completed with state SUCCESS. Commit: 140c9d2
/LLM/main/L0_MergeRequest_PR pipeline #29731 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants